Summary

Seeing that there is already a lot of EDA workbooks on the site, I thought I’d try to add some new metrics and plot types for everyone to enjoy. For such an open-ended task with so many interesting things to look at, I’d thought I’d focus on looking at a few different analysis questions.

Setup

## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.1
## v ggplot2 2.2.1     v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## Parsed with column specification:
## cols(
##   `Donor ID` = col_character(),
##   `Donor City` = col_character(),
##   `Donor State` = col_character(),
##   `Donor Is Teacher` = col_character(),
##   `Donor Zip` = col_character()
## )
## Parsed with column specification:
## cols(
##   `Project ID` = col_character(),
##   `Donation ID` = col_character(),
##   `Donor ID` = col_character(),
##   `Donation Included Optional Donation` = col_character(),
##   `Donation Amount` = col_double(),
##   `Donor Cart Sequence` = col_integer(),
##   `Donation Received Date` = col_datetime(format = "")
## )
## Parsed with column specification:
## cols(
##   `School ID` = col_character(),
##   `School Name` = col_character(),
##   `School Metro Type` = col_character(),
##   `School Percentage Free Lunch` = col_integer(),
##   `School State` = col_character(),
##   `School Zip` = col_character(),
##   `School City` = col_character(),
##   `School County` = col_character(),
##   `School District` = col_character()
## )

Donators’

As noted, first we will look at the characteristics of the donors. Firstly we create a master table using the donations and donors tables to produce metrics at the donor level of analysis to allow slicing and dicing further on.

Setup master table

df_donations_summary = df_donations %>%
  group_by(`Donor ID`) %>%
  arrange(`Donation Received Date`) %>%
  mutate(time_between_donations = round(
    as.numeric(`Donation Received Date`-lag(`Donation Received Date`), units = 'days'))) %>%
  summarise(num_projects = n_distinct(`Project ID`), 
            num_donations = n(), 
            sum_donations = sum(`Donation Amount`),
            first_donation = min(`Donation Received Date`),
            last_donation = max(`Donation Received Date`),
            mean_donation_gap = mean(time_between_donations, na.rm = TRUE),
            max_donation_gap = max(time_between_donations, na.rm = TRUE)) %>%
  mutate(repeat_donations = if_else(num_donations > 1, 1, 0), 
         repeat_projects = if_else(num_projects > 1, 1, 0),
         average_donation = sum_donations/num_donations) %>%
  left_join(df_donors, by = "Donor ID")

Distribution of donors

The brief suggests the organisation are interested in marketing new projects to existing donors, as a result

df_donations_dist = df_donations_summary %>%
  group_by(num_projects) %>%
  summarise(num_donors = n(), sum_donations = sum(sum_donations)) %>%
  mutate(average_donation = sum_donations/num_donors)

df_donations_dist %>%
  hchart('column', hcaes(x = num_projects, y = num_donors)) %>%
  hc_navigator(enabled = TRUE, xAxis = list(labels = list(enabled = FALSE)))

Unfortunately the massive peak at 1 and the long tail dominates the plot - although this does show us the vast majority of donors only donate to one project and that we have some golden donors. Moving slider along the bottom lets us focus on other specific ranges, however for convenience let’s focus on an area of interest.

df_donations_dist %>%
  filter(num_projects< 10) %>%
  hchart('column', hcaes(x = num_projects, y = num_donors)) %>%
  hc_navigator(enabled = TRUE, xAxis = list(labels = list(enabled = FALSE)))

Here we can see the number of donors falling away pretty exponentially compared to the number of projects, overall around 600k donors out of 2 million donors contribute to more than one project - with over half of this amount either 2 or 3 projects.

We could have a hunch that our marketing team want to hunt our multiple project donators for a reason though - let’s look at how much on average these groups donate overall!

df_donations_dist %>%
  filter(num_projects < 100) %>%
  hchart('column', hcaes(x = num_projects, y = average_donation)) %>%
  hc_navigator(enabled = TRUE, xAxis = list(labels = list(enabled = FALSE)))

Boomfa - look at that climbing staircase! - the more different projects people donate to, the more they will donate overall. The average seems to increase steadily (around $100 per project - perhaps the minimum donation amount?) and then plateaus areound $8k per donors.

Multiple donators map

df_state_summary = df_donations_summary %>%
  group_by(`Donor State`) %>%
  summarise(prop_rep_donations = 100*sum(repeat_donations)/n(),
            prop_rep_projects = 100*sum(repeat_projects)/n()) %>%
  inner_join(df_state_mapping, by = c('Donor State' = 'state_name')) %>%
  drop_na()
    
hcmap("countries/us/us-all", data = df_state_summary, value = "prop_rep_projects",
      joinBy = c("hc-a2", "state_code"), name = "Proportion Donating Multiple Projects",
      borderColor = "transparent") %>%
  hc_tooltip(valueDecimals = 1, valueSuffix = "%") %>%
  hc_colorAxis(stops = color_stops(6)) %>%
  #hc_colorAxis(dataClasses = color_classes(seq(min(df_state_summary$prop_rep_projects), max(df_state_summary$prop_rep_projects), length.out = 6))) %>% 
hc_legend(layout = "vertical", align = "right",
          floating = TRUE, valueDecimals = 0, valueSuffix = "%") 

Multiple donators over time

df_monthly_summary = df_donations_summary %>%
  mutate(first_month = as.Date(floor_date(first_donation, "month"))) %>%
  #mutate(first_month =  format(as.Date(first_donation), "%Y-%m")) %>%
  group_by(first_month) %>%
  summarise(prop_rep_donations = 100*sum(repeat_donations)/n(),
            prop_rep_projects = 100*sum(repeat_projects)/n()) %>%
  ungroup() %>%
  gather(variable, value, -first_month)

df_monthly_summary %>%
  hchart("line", hcaes(x = first_month, y = value, group = variable)) %>%
  hc_navigator(enabled = TRUE) %>%
  hc_rangeSelector(enabled = TRUE) %>%
  hc_tooltip(shared = TRUE, valueDecimals = 1, valueSuffix = "%")

States over time

df_monthly_state_summary = df_donations_summary %>%
  mutate(first_month = as.Date(floor_date(first_donation, "month"))) %>%
  group_by(first_month, `Donor State`) %>%
  summarise(prop_rep_donations = 100*sum(repeat_donations)/n(),
            prop_rep_projects = 100*sum(repeat_projects)/n()) %>%
  ungroup()
  #gather(variable, value, -first_month, -`Donor State`) %>%
  #mutate(donor_state = `Donor State`)

df_monthly_state_summary %>%
  hchart("heatmap", hcaes(y = first_month, x = `Donor State`, value = prop_rep_projects)) %>%
  hc_colorAxis(stops = color_stops(10)) %>%
  hc_tooltip(shared = TRUE, valueDecimals = 1, valueSuffix = "%") %>%
  hc_yAxis(showLastLabel = FALSE, showFirstLabel = FALSE)

Donation times

Distribution of mean

df_donations_summary %>%
  mutate(mean_donation_gap = round(mean_donation_gap)) %>%
  group_by(mean_donation_gap) %>%
  summarise(num_donors = n()) %>%
  drop_na() %>%
  hchart("line", hcaes(x = mean_donation_gap, y = num_donors)) %>%
  hc_navigator(enabled = TRUE, xAxis = list(labels = list(enabled = FALSE)))
df_donations_summary %>%
  mutate(max_donation_gap = round(max_donation_gap)) %>%
  group_by(max_donation_gap) %>%
  summarise(num_donors = n()) %>%
  drop_na() %>%
  hchart("line", hcaes(x = max_donation_gap, y = num_donors)) %>%
  hc_navigator(enabled = TRUE, xAxis = list(labels = list(enabled = FALSE)))

Map

df_state_time = df_donations_summary %>%
  group_by(`Donor State`) %>%
  summarise(mean_donation_gap = mean(mean_donation_gap, na.rm = TRUE)) %>%
  inner_join(df_state_mapping, by = c('Donor State' = 'state_name')) %>%
  drop_na()
  
hcmap("countries/us/us-all", data = df_state_time, value = "mean_donation_gap",
      joinBy = c("hc-a2", "state_code"), name = "Mean time between donations",
      borderColor = "transparent") %>%
hc_tooltip(valueDecimals = 1) %>%
hc_colorAxis(stops = color_stops(6)) %>%
hc_legend(layout = "vertical", align = "right",
          floating = TRUE, valueDecimals = 0)  

Bonus - Where them schools at?

Both highcharts and the schools database has the county level data - why not plot it?

df_schools_joined <- df_schools %>% 
  left_join(df_state_mapping, by = c('School State' = 'state_name')) %>%
  left_join(df_county_mapping, by = c('School County' = 'county_name'))
df_schools_num_county = df_schools_joined %>%
  group_by(county_code) %>%
  summarise(num_schools = n()) %>%
  ungroup() %>%
  drop_na() %>%
  right_join(df_county_mapping) %>%
  replace_na(list(num_schools = 0))
## Joining, by = "county_code"
hcmap("countries/us/us-all-all", data = df_schools_num_county, value = "num_schools",
           joinBy = c("hc-key", "county_code"), name = "Number of Schools",
           borderColor = "transparent") %>%
  #hc_colorAxis() %>%
  hc_colorAxis(dataClasses = color_classes(c(1, 10, 100, 1000, max(df_schools_num_county$num_schools)))) %>%
  hc_title(text = "Where the schools at!?") %>%
  hc_legend(layout = "vertical", align = "right",
            floating = TRUE, valueDecimals = 0)